README

Paper title: Early Joiners and Startup Performance
Authors: Choi, Goldschlag, Haltiwanger, Kim
Publication Date: 20230829

This document does not contain confidential information. The programs discussed in this document and made available to researchers have been reviewed to insure no confidential information in included in the code. The outputs correspond to disclosure review numbers DRB-B0043-CED-20190418, DRB-B0049-CED-20190503, CBDRB-FY19-398,  CBDRB-FY21-CES007-001, CBDRB-FY22-CES008-002, CBDRB-FY23-07, CBDRB-FY23-CES007-04 (DMS P-7517031).

All the empirical results in the paper use confidential microdata from the U.S. Census Bureau. The analyses require access to the following confidential microdata that are archived under the DMS project 7517031:

    * spd_founding_team_5yr.dta - Derived from startup panel database, contains  5-year outcomes of startups, created in Step 3, described below.
    * spd_jcr.dta - Startup panel database, contains annual observations for startups.
    * public data inputs described above. 
    * skill_intensity.dta - College educated worker industry shares, created in Step 3.
    * hpsector_1990_2015.dta - Hurst-Puglsey industry indicators, created in Step 3.
    * nondirectional_hct_naics.dta - Human capital transferability measures, created in Step 3.

As described below, these confidential databases are derived from the following integration of confidential databases:
    * Longitudinal Business Database (including revenue) (1990-2015)
    * Longitudinal Employer-Household Dynamics data (1990-2015, ICF, ECFT26, PHF interleave files)
    * Business Register (EINUNITS, SSNUNITS, SSL) (1990-2015)
    * American Community Survey data (2001-2017)
    * 2000 Decennial Hundred-Percent Detail File
    * Census NUMIDENT

We have also archived the code to generate the analytic databases from the above files. This code is not releasable since it includes confidential information. The archive for this project includes both the analytic databases referred to above and the data infrastructure construction code. We have provided a sanitized version of the code used to produce the analytic results in this release.   

To gain access to the Census microdata, follow the directions here on how to write a proposal for access to the data via a Federal Statistical Research Data Center (FSRDC): https://www.census.gov/programs-surveys/ces/data/restricted-use-data/apply-for-access.htm

I. Directory Structure

./in_confidential/ - datasets with confidential data (user will need to access to these data)
./in_public/ - datasets containing only public information.
./out/ - output files with summary statistics and regression estimates.
    ./out/tables/ - tex tables for the paper. 
    ./out/figures/ - png figures for the paper. 
./programs/ - program files used to create data infrastructure and execute analysis. 

II. Public Data Inputs
    * acpi.sas7bdat, cpi.sas7bdat, ipindex_updated.sas7bdat - CPI-U-RS indices.
    * delgado_1998_2013.dta - Data from Delgado, Mercedes and Karen G Mills (2020) "The supply chain economy: A new industry categorization for understanding innovation in services," Research Policy, 49(8).
    * oes2007_soc2000dd_task_alm.csv - Combination of data from the BLS OES industry occupation matrix for 2007 and Autor, David, David Dorn (2013) "The growth of low-skill service jobs and the polarization of the US labor market," American Economic Review, 103(5), 1553-97.
    * stemunion07.dta - Data from Goldschlag, Nathan and Javier Miranda (2020) "Business dynamics of high tech industries," Journal of Economics & Management Science, 29(1), 3-30. 

III. Replication Steps

The code for Step 1-3 contain confidential information. That code can be requested by approved projects via the FSRDC system. We provide a description of those programs below. This repository contains the code that creates analysis files, computes summary statistics, estimates regressions, and generates figures and tables for the paper.

Step 1. Create Founding Team Database 

    Create the founding team database, founding_team_fyjoiner.sas7bdat. We follow similar steps for the creation of a second-year joiner dataset. To create the list of startup firm IDs from LBD and their associated SEINs (LEHD employer identifiers), researchers need to merge LBD firm IDs and SEINs using FAS_EIN variable available in ECF Interleave SEINUNIT Title 26 file. Note that LBD measures employment in the week of March 12 every year (i.e., first quarter) and a firm is classified as a startup in the year it first appears with positive employment in LBD. Therefore, a startup in year t can potentially have employees in the LEHD in the first quarter of year t as well as the second, third, and fourth quarter of year t-1. Using the PHF Interleave file, we identify the initial quarter with a positive number of employees for each startup in the LBD, and founding team members are defined as individuals that have positive earnings within the first four quarters of operation. Because sole proprietorship owners do not appear in the LEHD, we obtain their information from the BR EINUNITS file and SSNUNITS file (for 2002-2015) as well as from SSEL files (for 1990-2001). We obtain age, race, gender, and education information of founding team members (including sole proprietorship owners) from the LEHD ICF file. Whenever those variables are imputed or missing in the ICF for a given individual, we obtain corresponding information from the ACS files. We collect date of death from the Census NUMIDENT data. The resulting file, founding_team_fyjoiner.sas7bdat, is constructed at the firm ID and individual ID level, and it has information about quarterly earnings for each individual at the firm, demographics and educational attainment of the individual, founder/early joiner indicator, and information about jobs (e.g., earnings, industry, employer characteristics) they had before joining the startups. 

Step 2. Create Data Startup Panel Database

    Create the startup panel dataset, spd_jcr.sas7bdat. spd_jcr.sas7bdat is a panel database that tracks activities (e.g., employment, payroll, revenue, entry, exit) of startups on an annual basis. Because firm identifiers in the LBD can change for various reasons, we construct longitudinally consistent firm identifiers of startups by tracking establishment flows. Specifically, we track single-to-multi establishment expansions, multi-to-single establishment contractions, and multi-to-multi establishment transitions. We then bring in industry, state, firm age, annual employment, quarterly and annual payroll, entry, and exit from LBD and annual revenue from LBD.

Step 3. Create Analysis Files

    In this step we create a variety of files necessary for the analysis. This includes the 5-year summary outcomes for firms, death matched samples, non-directional human capital transferability, and skill intensity. 
        * For the 5-year outcomes, we take the founding_team_fyjoiner file to calculate summary statistics of founding teams at the startup level, such as average prior earnings, share of male, college graduate share, average age. We then use the spd_jcr file to calculate size at firm age zero and five and survival status at firm age five.
        * For the death matched samples, both for first-year joiners and second-year joiners, we take the founding_team_fyjoiner file and extract the list of individuals who deceased prematurely at firm age 10 or less.  Let firms that were hit by the founding team member death shock be “treated” firms. Drop firms from “treated” if two or more founding team members have deceased within the first 10 years of operation to ensure that the firms were treated only once. Use founding_team_fyjoiner file to calculate founding team characteristics, in each quarter of operation, such as the number of team member remaining at the firm, their average age, and their average prior earnings. Take all startups that were never treated by premature member death shock and call them potential controls. For each firm treated in a given year-quarter, find a control firm by matching on (i) state, (ii) legal form of organization, (iii) the number of founding team members remaining and their age.  Use coarsened exact matching algorithm.  Once a control firm is matched to a treated firm, remove the firm from potential controls to generate one-to-one matching without replacement. Merge the matched pairs with spd_jcr file to construct a firm panel dataset that has firm age, state, industry, employment, revenue, survival, treat/control indicator, characteristics of the deceased person, firm survival status, etc. 
        * For the skill intensity measures, we use the PHF Interleave file to get the list of individuals with positive earnings for each SEIN in each quarter.  Merge the file with ECF to get 2007 NAICS for each SEIN. Merge the file by individual ID with the ICF to get individuals' educational attainment. Keep non-imputed records.  Calculate the share of college graduates in each four-digit NAICS industry in each quarter, and calculate the average share across time to obtain time-invariant skill intensity at the four-digit NAICS level.
        * For the human capital transferrability measures, we use the PHF Interleave file to get the list of individuals with full-quarter jobs in each quarter.  If a person has multiple full-quarter jobs, keep only the one with highest earnings (i.e., main job). Identify individuals who switch full-quarter jobs (i.e., SEINs) from quarter t to t+2, t+3, or t+4.  Therefore, we allow for previous and new jobs to have at most one quarter of overlap.  Merge the file with the ECF to get 2007 NAICS for previous and new SEINs. Calculate the number of individuals who switch jobs from industry A to B divided by the number of individuals who switch jobs from industry A to any industry (including industry A itself).  Similarly, calculate the number of individuals who switch jobs from industry B to A divided by the number of individuals who switch jobs from industry B to any industry.  Take the average between the two values to get the nondirectional human capital transferability between industry A and B in each year.  Take the average value of human capital transferability across all years. Calculate this measure for all possible pairs of four-digit NAICS industries.


Step 4. Execute Analysis 

    Execute r1_gen_output.do followed by r2_make_tables_figures.do.
    
    0_config.do
        * Contains parameters and macros common to the stata programs. 
    
    r1_gen_output.do 
        * Prepares regression panels, computes summary and descriptive statistics, estimates regressions.
        Inputs: 
            * spd_founding_team_5yr.dta - Derived from startup panel database, contains  5-year outcomes of startups, created in Step 3.
            * spd_jcr.dta - Startup panel database, contains annual observations for startups.
            * public data inputs described above. 
            * skill_intensity.dta - College educated worker industry shares, created in Step 3.
            * hpsector_1990_2015.dta - Hurst-Puglsey industry indicators, created in Step 3.
            * nondirectional_hct_naics.dta - Human capital transferability measures, created in Step 3.
        Outputs:
            * ./out/dhs_emp.xlsx - Contains estimation output where the LHS variable is dhs(emp).
            * ./out/dhs_rev.xlsx - Contains estimation output where the LHS variable is dhs(rev).
            * ./out/extras.xlsx - Contains additional estimation output and summary statistics. 
            * ./out/readme.txt - Contains a listing of what output is contained on each sheet of the xlsx files.  
    
    r2_make_tables_figures.do 
        * Converts information in the xlsx files in ./out/ into tex tables and png figures. 
        Inputs: 
            * ./out/dhs_emp.xlsx - Contains estimation output where the LHS variable is dhs(emp).
            * ./out/dhs_rev.xlsx - Contains estimation output where the LHS variable is dhs(rev).
            * ./out/extras.xlsx - Contains additional estimation output and summary statistics. 
        Outputs:
            * ./out/tables/*.tex - Table code for the paper. 
            * ./out/figures/*.png - Figures for the paper. 
